Max Piazza
David Orona
Nithya Arumugam
Abhitej Bokka

Data Analysis Project: Modeling Used Car Prices

Introduction

In an ever-volatile market where every dollar counts, the used car market represents a critical sector of the consumer industry. With rising consumer demand and an increasing variety of vehicles entering the secondary market, understanding the factors that influence used car prices is essential. Buyers seek to make informed decisions based on value for money, while sellers aim to maximize returns by accurately pricing their vehicles. Bridging this gap requires a data-driven approach to uncover the relationships between vehicle specifications and market pricing.

This project utilizes the “Vehicle dataset” by Nehal Birla, Nishant Verma, and Nikhil Kushwaha, available on Kaggle

This dataset aggregates detailed information on over 10,000 used cars, including key variables such as fuel type, transmission, engine capacity, mileage, and kilometers driven, alongside categorical variables like seller type, ownership history, and geographic location.

Our analysis is driven by three core objectives:

  • To model the relationship between vehicle specifications (e.g., mileage, transmission, and fuel type) and their pricing.
  • To identify regional trends and seller-specific factors influencing market prices.
  • To evaluate how performance metrics, such as engine power and fuel efficiency, impact purchasing behavior.

Using statistical modeling techniques, including regression analysis, we aim to deliver a robust and interpretable model that not only predicts car prices but also highlights the most influential factors driving price variations. Our results will shed light on market dynamics, offering actionable insights for both consumers and industry professionals navigating this volatile space.

By the conclusion of this project, we aim to provide a detailed analysis that enhances understanding of the used car market, aiding stakeholders in making informed decisions in an ever-changing economic landscape.

The dataset contains the following attributes

Attributes Description
1 Name The brand name of the vehicle (e.g., Hyundai i10, Honda City).
2 Year The year the vehicle was manufactured.
3 Selling Price The selling price of the vehicle in INR.
4 Km Driven The total distance the vehicle has been driven.
5 Fuel The type of fuel used by the vehicle (e.g Petrol, Diesel)
6 Seller Type The type of seller (e.g Individual, Dealer).
7 Transmission The type of transmission system (e.g Manual, Automatic).
8 Owner The number of previous owners of the vehicle (e.g., First Owner, Second Owner).
9 Mileage The fuel efficiency of the vehicle.(in km/l or km/kg)
10 Engine The engine displacement capacity.(in CC)
11 Max Power The maximum power output of the vehicle’s engine measured in horsepower.(in HP).
12 Torque The maximum torque.
13 Seats The seating capacity of the vehicle (e.g 5-seater, 7-seater).

The above attributes can be categorized into Numeric and Categorical variables, with selling price as continuous numeric response variable.

Attribute Type Attribute
Categorical Variables Name, Fuel, Seller Type, Transmission, Owner
Discrete Numeric Variables Year, Km Driven, Seats
Continuous Numeric Variables Selling Price, Mileage, Engine, Max Power, Torque

Load dataset

car_details = read_csv("Car details v3.csv")

Total number of observations

nrow(car_details)
## [1] 8128

Excluding missing values from dataset

car_details = na.omit(car_details)
nrow(car_details)
## [1] 7906

Excluding duplicate rows from dataset

sum(duplicated(car_details))
## [1] 1189
car_details = car_details[!duplicated(car_details),]
nrow(car_details)
## [1] 6717

Sample data from Vehicle’s dataset

head(car_details)
## # A tibble: 6 × 13
##   name         year selling_price km_driven fuel  seller_type transmission owner
##   <chr>       <dbl>         <dbl>     <dbl> <chr> <chr>       <chr>        <chr>
## 1 Maruti Swi…  2014        450000    145500 Dies… Individual  Manual       Firs…
## 2 Skoda Rapi…  2014        370000    120000 Dies… Individual  Manual       Seco…
## 3 Honda City…  2006        158000    140000 Petr… Individual  Manual       Thir…
## 4 Hyundai i2…  2010        225000    127000 Dies… Individual  Manual       Firs…
## 5 Maruti Swi…  2007        130000    120000 Petr… Individual  Manual       Firs…
## 6 Hyundai Xc…  2017        440000     45000 Petr… Individual  Manual       Firs…
## # ℹ 5 more variables: mileage <chr>, engine <chr>, max_power <chr>,
## #   torque <chr>, seats <dbl>

Data cleaning

# Extract the first word from "name" - to get the "make" of the vehicle
car_details$name = word(car_details$name,1)
# Rename name to make
colnames(car_details)[1] = "make"
# Change datatypes of "make" from character to factor
car_details$make = as.factor(car_details$make)

# Extract the numeric value from mileage
car_details$mileage = word(car_details$mileage,1)
#Change data_type of "mileage" from character to numeric
car_details$mileage=as.numeric(car_details$mileage)

# Extract the numeric value from engine
car_details$engine = word(car_details$engine,1)
#Change data_type of "engine" from character to numeric
car_details$engine=as.numeric(car_details$engine)

# Extract the numeric value from max_power
car_details$max_power = word(car_details$max_power,1)
#Change data_type of "max_power" from character to numeric
car_details$max_power=as.numeric(car_details$max_power)

# Change datatypes of "fuel", "seller_type", "transmission","owner" 
# From character to factor
car_details$fuel=as.factor(car_details$fuel)
car_details$seller_type=as.factor(car_details$seller_type)
car_details$transmission=as.factor(car_details$transmission)
car_details$owner=as.factor(car_details$owner)
#colnames(car_details)[7]="AM"

##KM driven in ten thousands
car_details$km_driven = car_details$km_driven/10000
#Rename  km_driven
colnames(car_details)[4]="km_driven_in_10k"

#Dropping column torque
car_details = car_details[,!names(car_details) %in% "torque"]

#Selling price in ten thousands
car_details$selling_price = car_details$selling_price/10000
#Rename Selling price
colnames(car_details)[3]="selling_price_in_10k"

Sample data

head(car_details)
## # A tibble: 6 × 12
##   make     year selling_price_in_10k km_driven_in_10k fuel   seller_type
##   <fct>   <dbl>                <dbl>            <dbl> <fct>  <fct>      
## 1 Maruti   2014                 45               14.6 Diesel Individual 
## 2 Skoda    2014                 37               12   Diesel Individual 
## 3 Honda    2006                 15.8             14   Petrol Individual 
## 4 Hyundai  2010                 22.5             12.7 Diesel Individual 
## 5 Maruti   2007                 13               12   Petrol Individual 
## 6 Hyundai  2017                 44                4.5 Petrol Individual 
## # ℹ 6 more variables: transmission <fct>, owner <fct>, mileage <dbl>,
## #   engine <dbl>, max_power <dbl>, seats <dbl>

Unique car brands

unique(car_details$make)
##  [1] Maruti        Skoda         Honda         Hyundai       Toyota       
##  [6] Ford          Renault       Mahindra      Tata          Chevrolet    
## [11] Datsun        Jeep          Mercedes-Benz Mitsubishi    Audi         
## [16] Volkswagen    BMW           Nissan        Lexus         Jaguar       
## [21] Land          MG            Volvo         Daewoo        Kia          
## [26] Fiat          Force         Ambassador    Ashok         Isuzu        
## [31] Opel         
## 31 Levels: Ambassador Ashok Audi BMW Chevrolet Daewoo Datsun Fiat ... Volvo

Creating new varaible make_category

  • From the above result we can see that there are 31 unique values for “make”.
  • When we model the data using the independent variable “make”, there will be at least 30 dummy variables as predictors.
  • To reduce the model complexity and to increase interpretability, the car make can be grouped into broader categories as “Budget”, “Mid-Range” or “Luxury” depending on general market perception.
car_details$make_category = ifelse(car_details$make %in% 
                                     c("Maruti", "Tata", "Mahindra", "Datsun", 
                                        "Renault", "Chevrolet", "Fiat", 
                                         "Daewoo", "Ambassador", "Ashok"), 
                                    "Budget",
                             ifelse(car_details$make %in% 
                                      c("Honda", "Ford", "Hyundai","Toyota", "Volkswagen","Nissan", 
                                        "Skoda", "Mitsubishi","Force", "Kia","MG"), 
                                    "Midrange", 
                                    "Luxury"))
car_details$make_category = as.factor(car_details$make_category)
levels(car_details$make_category)
## [1] "Budget"   "Luxury"   "Midrange"

Final structure of the dataset car_details

structure_car_details = data.frame(
  Column = names(car_details),
  Type = unname(sapply(car_details, class)))
print(structure_car_details)
##                  Column    Type
## 1                  make  factor
## 2                  year numeric
## 3  selling_price_in_10k numeric
## 4      km_driven_in_10k numeric
## 5                  fuel  factor
## 6           seller_type  factor
## 7          transmission  factor
## 8                 owner  factor
## 9               mileage numeric
## 10               engine numeric
## 11            max_power numeric
## 12                seats numeric
## 13        make_category  factor

Data analysis

car_details$selling_price_in_10k[car_details$selling_price_in_10k > 720]
## [1] 1000
car_details=subset(car_details,subset = car_details$selling_price_in_10k < 720,)
  • There is one observation, which is quiet different from the general pattern of selling price. This can impact the model that we build, hence excluded one observation with selling_price = 1000.

Selling Price distribution

hist(car_details$selling_price_in_10k,xlab="Selling Price (in Ten Thousands)",
     main="Selling Price distribution",
     col = "lightblue")

  • The above plot is positively skewed, meaning the selling price for most of the observations are less than or equal to 200 and there are very less number of observations are above 20,00,000 INR.

Frequency Distribution of categorical variables

make_counts = table(car_details$make)
fuel_type_car_count = table(car_details$fuel)
seller_type_car_count = table(car_details$seller_type)
trans_type_car_count = table(car_details$transmission)
owner_type_car_count = table(car_details$owner)


par(mfrow = c(2, 3))
# Plot 1: Number of cars by make
barplot(sort(make_counts),horiz=TRUE, las = 1,
        xlab = "Number of Cars",
        ylab = "Car Make",
        col = "lightblue",
        cex.names = 0.5,
        main = "Num. of Cars in Each Make")

# Plot 2: Number of cars by fuel type
barplot(sort(fuel_type_car_count), horiz = TRUE, las = 1, cex.names = 0.9,
        col = "lightblue", main = "Num. of Cars in Each Fuel Type")

# Plot 3: Number of cars by seller type
barplot(sort(seller_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
        col = "lightblue", main = "Num. of Cars in Each Seller Type")

# Plot 4: Number of cars by trans type
barplot(sort(trans_type_car_count), horiz = TRUE, las = 1, cex.names = 0.8,
        col = "lightblue", main = "Num. of Cars in Each Trans Type")

# Plot 5: Number of cars by owner type
barplot(sort(owner_type_car_count), horiz = TRUE, las = 1, cex.names = 0.6,
        col = "lightblue", main = "Num. of Cars in Each Owner Type")

Box-plots of categorical varaibles vs selling price

Selling Price based on make

boxplot(selling_price_in_10k ~ make, data = car_details,col=rainbow(length(unique(car_details$make))),
        las = 2,cex.axis = 0.7,               
        main = "Car Make Vs Selling Price",
        xlab = "Car Make",
        ylab = "Selling Price (in 10k)")

  • The above boxplot indicates a significant variation in selling prices across different car makes. Some brands have a much wider price range than others.
  • Example Mercedes-Benz, BMW, Jaguar,Land Rover and Volvo have the highest median selling prices.
  • While cars like Tata, Maruti, and Daewoo have the lowest median selling prices
par(mfrow = c(1, 2))
boxplot(selling_price_in_10k ~ fuel, data=car_details,cex.axis = 0.8,
        col=rainbow(length(unique(car_details$fuel))),
     main="SellingPrice based on fuel type")
plot(selling_price_in_10k ~ seller_type, data=car_details,cex.axis = 0.6,
     col=rainbow(length(unique(car_details$seller_type))),
     main="SellingPrice based on seller type")

  • Diesel powered cars have highest median selling price followed by petrol powered and CNG or LPG cars.
  • The Individual seller type cars have lowest median cost compared to Dealers.
par(mfrow = c(1, 2))
plot(selling_price_in_10k ~ transmission, data=car_details,
     col=rainbow(length(unique(car_details$transmission))),
     main="SellingPrice based on trans. type")
plot(selling_price_in_10k ~ owner, data=car_details,las=2,cex.axis = 0.5,
     col=rainbow(length(unique(car_details$owner))),
     main="SellingPrice based on owner type")

  • We can see that there is a slight difference in the selling price of the Automatic and Manual transmission cars. The median cost of Automatic transmission cars are higher than Manual transmission cars.
  • The selling price of cars across different owner types are significantly different. The median price of test drive cars are very high and the rest of the owner types have low median cost

Scatter plot of numerical varaibles vs selling price

colours = ifelse(car_details$transmission == "Automatic", "blue", "red")
plot(selling_price_in_10k ~ year,data=car_details,col=colours,pch=19,
     main = "Selling Price based on Year (with Transmission Type)",
     xlab = "Year",
     ylab = "Selling Price (in Ten Thousands)")
legend("topleft", legend = c("Automatic", "Manual"),
       col = c("blue", "red"), pch = 19)

  • We can see a positive correlation between year and selling price. As the year increases, the selling price is increasing.
  • This suggests that newer cars are priced higher than older ones.
  • We can also see that automatic transmission cars have a higher selling price across all years.
colours = c("Budget" = "blue", "Midrange"="green","Luxury"= "orange")
plot(selling_price_in_10k ~ year,data=car_details,
     col = colours[car_details$make_category],pch=19,
     main = "Selling Price based on Year (with make_category)",
     xlab = "Year",
     ylab = "Selling Price (in Ten Thousands)")
legend("topleft", legend = c("Budget","Midrange","Luxury"),
       col = c("blue", "green","orange"), pch = 19)

  • Similar to previous plot, there is a positive correlation between year and selling price.
  • We see that cost of Budget cars, mid range cars and luxury car increases with Year.
par(mfrow = c(1, 2))
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
     xlab="KM_Driven((in Ten Thousand)",
     ylab="Selling_Price(in Ten Thousand)",
     main="Price based on km_driven")
car_details = subset(car_details,car_details$km_driven_in_10k < 100,)
plot(selling_price_in_10k ~ km_driven_in_10k, data=car_details,
     xlab="KM_Driven((in Ten Thousand)",
     ylab="Selling_Price(in Ten Thousand)",
     main="After removing extreme values")

  • The relationship between selling_price and km_driven doesn’t seem to be strongly linear
  • But we can see that as the km_driven increases, the selling price remains in low range.
  • There are 2 observations which are different from the general pattern with values of KM Driven(150.0000 236.0457). This can been seen in the above plot left side.
  • These observations can impact the model. Hence those two data points are excluded.
plot(selling_price_in_10k ~ mileage, data=car_details,
     xlab="Mileage",
     ylab="Selling_Price(in Ten Thousand)",
     main="Price based on Mileage")

  • Most data points are clustered at mileage values 10 to 30, and there doesn’t seem to be a linear relationship.
  • That is higher mileage doesn’t indicate higher selling price.

  • From the “Price based on engine” plot, we can see that the selling price is high for higher engine power
  • From the “Price based on max_power” plot, we can see that the there is a linear relationship between max_power and selling price.
  • When the maximum power increases, the selling price of the car is increasing

Conclusion of Data analysis

  • The selling price distribution is positively skewed. Positively skewed data has extreme values which makes it hard to fit models.
  • Hence Logarithmic transformation can make the selling price distribution to be normally distributed.
  • Logarithmic transformations can also help stabilize the variance, making the data more homoscedastic and suitable for analysis.
  • There is a positive correlation between year and selling price.
  • Cost of Budget cars, mid range cars and luxury car increases with Year.
  • As the km_driven increases, the selling price tends to decrease.
  • There is no impact of mileage on selling price
  • The engine and max_power also tends to increase with selling price.
  • The median cost of Automatic transmission cars are higher than Manual transmission cars.
  • The median price of test drive cars are very high when compared to other owner types

Correlation of the numeric variables

pairs(selling_price_in_10k ~ year+km_driven_in_10k+mileage+engine+max_power+seats ,data=car_details)

cor_mat = cor(car_details[,sapply(car_details,is.numeric)])
cor_mat
##                             year selling_price_in_10k km_driven_in_10k
## year                  1.00000000            0.4383902      -0.45453560
## selling_price_in_10k  0.43839024            1.0000000      -0.19714853
## km_driven_in_10k     -0.45453560           -0.1971485       1.00000000
## mileage               0.36669288           -0.1257948      -0.23116300
## engine               -0.02073416            0.4510495       0.30389598
## max_power             0.15893720            0.6872307       0.04747428
## seats                 0.02483339            0.1656998       0.24746651
##                         mileage      engine   max_power       seats
## year                  0.3666929 -0.02073416  0.15893720  0.02483339
## selling_price_in_10k -0.1257948  0.45104953  0.68723067  0.16569977
## km_driven_in_10k     -0.2311630  0.30389598  0.04747428  0.24746651
## mileage               1.0000000 -0.58120564 -0.39006970 -0.45881198
## engine               -0.5812056  1.00000000  0.68630272  0.65884340
## max_power            -0.3900697  0.68630272  1.00000000  0.26214681
## seats                -0.4588120  0.65884340  0.26214681  1.00000000
high_cor = cor_mat[cor_mat > 0.5 & cor_mat != 1]
high_cor_indices = which(cor_mat > 0.5 & cor_mat != 1, arr.ind = TRUE)
high_cor_df = data.frame(
  row = rownames(cor_mat)[high_cor_indices[, 1]],
  column = colnames(cor_mat)[high_cor_indices[, 2]],
  correlation = high_cor)
head(high_cor_df, 3)
##         row               column correlation
## 1 max_power selling_price_in_10k   0.6872307
## 2 max_power               engine   0.6863027
## 3     seats               engine   0.6588434

Split data into train and test

set.seed(125)  
train_indices = sample(nrow(car_details), size = 0.80 * nrow(car_details))
train_data = car_details[train_indices, ]
test_data = car_details[-train_indices, ]
nrow(train_data)
## [1] 5370
nrow(test_data)
## [1] 1343

Additive Full model without “Make”

full_model = lm(selling_price_in_10k ~ .-make, data = train_data)
summary(full_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ . - make, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -206.057  -10.898   -1.291    8.093  313.265 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -7.227e+03  2.707e+02 -26.697  < 2e-16 ***
## year                         3.591e+00  1.354e-01  26.518  < 2e-16 ***
## km_driven_in_10k            -9.175e-01  9.513e-02  -9.645  < 2e-16 ***
## fuelDiesel                   8.676e+00  4.266e+00   2.034  0.04204 *  
## fuelLPG                      1.168e+01  6.595e+00   1.771  0.07666 .  
## fuelPetrol                   6.844e-01  4.289e+00   0.160  0.87322    
## seller_typeIndividual       -8.027e+00  1.283e+00  -6.257 4.23e-10 ***
## seller_typeTrustmark Dealer -4.045e+00  5.454e+00  -0.742  0.45833    
## transmissionManual          -1.569e+01  1.577e+00  -9.947  < 2e-16 ***
## ownerFourth & Above Owner   -2.610e+00  2.483e+00  -1.051  0.29315    
## ownerSecond Owner           -5.009e+00  8.999e-01  -5.566 2.73e-08 ***
## ownerTest Drive Car          2.425e+02  1.199e+01  20.222  < 2e-16 ***
## ownerThird Owner            -4.364e+00  1.542e+00  -2.831  0.00466 ** 
## mileage                      5.343e-02  1.598e-01   0.334  0.73805    
## engine                       1.089e-02  1.755e-03   6.206 5.84e-10 ***
## max_power                    5.040e-01  2.053e-02  24.546  < 2e-16 ***
## seats                        1.646e+00  5.983e-01   2.750  0.00598 ** 
## make_categoryLuxury          1.101e+02  3.120e+00  35.299  < 2e-16 ***
## make_categoryMidrange        2.385e+00  8.360e-01   2.853  0.00435 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.51 on 5351 degrees of freedom
## Multiple R-squared:  0.7247, Adjusted R-squared:  0.7238 
## F-statistic: 782.6 on 18 and 5351 DF,  p-value: < 2.2e-16
par(mfrow = c(1, 2))
plot(full_model,which=c(1,2))

vif(full_model)
##                      GVIF Df GVIF^(1/(2*Df))
## year             2.126959  1        1.458410
## km_driven_in_10k 1.639337  1        1.280366
## fuel             2.494514  3        1.164567
## seller_type      1.130152  2        1.031061
## transmission     1.447493  1        1.203118
## owner            1.401437  4        1.043090
## mileage          3.166319  1        1.779415
## engine           5.599938  1        2.366419
## max_power        3.094793  1        1.759202
## seats            2.537822  1        1.593054
## make_category    1.962517  2        1.183596

BIC of Full model | Both direction

n= nrow(train_data)
bic_full_model = step(full_model,direction="both",k=log(n),trace=0)
summary(bic_full_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ year + km_driven_in_10k + 
##     fuel + seller_type + transmission + owner + engine + max_power + 
##     make_category, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -206.583  -10.943   -1.313    8.151  313.568 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -7.370e+03  2.431e+02 -30.316  < 2e-16 ***
## year                         3.665e+00  1.204e-01  30.440  < 2e-16 ***
## km_driven_in_10k            -8.999e-01  9.496e-02  -9.476  < 2e-16 ***
## fuelDiesel                   9.099e+00  4.263e+00   2.134   0.0329 *  
## fuelLPG                      1.224e+01  6.564e+00   1.864   0.0623 .  
## fuelPetrol                   1.136e+00  4.257e+00   0.267   0.7895    
## seller_typeIndividual       -7.968e+00  1.283e+00  -6.211 5.66e-10 ***
## seller_typeTrustmark Dealer -4.302e+00  5.456e+00  -0.788   0.4305    
## transmissionManual          -1.537e+01  1.572e+00  -9.778  < 2e-16 ***
## ownerFourth & Above Owner   -2.449e+00  2.483e+00  -0.986   0.3240    
## ownerSecond Owner           -5.020e+00  9.003e-01  -5.576 2.59e-08 ***
## ownerTest Drive Car          2.426e+02  1.200e+01  20.226  < 2e-16 ***
## ownerThird Owner            -4.405e+00  1.542e+00  -2.856   0.0043 ** 
## engine                       1.345e-02  1.198e-03  11.226  < 2e-16 ***
## max_power                    4.937e-01  2.013e-02  24.530  < 2e-16 ***
## make_categoryLuxury          1.083e+02  3.051e+00  35.492  < 2e-16 ***
## make_categoryMidrange        1.661e+00  7.945e-01   2.091   0.0366 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.52 on 5353 degrees of freedom
## Multiple R-squared:  0.7243, Adjusted R-squared:  0.7235 
## F-statistic:   879 on 16 and 5353 DF,  p-value: < 2.2e-16
plot(bic_full_model,which=c(1,2))

vif(bic_full_model)
##                      GVIF Df GVIF^(1/(2*Df))
## year             1.679628  1        1.296005
## km_driven_in_10k 1.632010  1        1.277502
## fuel             1.635248  3        1.085419
## seller_type      1.128210  2        1.030617
## transmission     1.436714  1        1.198630
## owner            1.399679  4        1.042926
## engine           2.609471  1        1.615386
## max_power        2.970440  1        1.723497
## make_category    1.741865  2        1.148824

Small model with interaction

small_model = lm(selling_price_in_10k ~ make_category*year*max_power + year*transmission + owner + seller_type + km_driven_in_10k+mileage, data = train_data)
summary(small_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ make_category * year * max_power + 
##     year * transmission + owner + seller_type + km_driven_in_10k + 
##     mileage, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -168.924   -9.112   -1.273    7.930  215.795 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           2.745e+03  8.799e+02   3.120  0.00182 ** 
## make_categoryLuxury                   8.824e+04  6.964e+03  12.671  < 2e-16 ***
## make_categoryMidrange                 8.867e+03  1.084e+03   8.179 3.55e-16 ***
## year                                 -1.356e+00  4.369e-01  -3.105  0.00191 ** 
## max_power                            -1.664e+02  7.047e+00 -23.613  < 2e-16 ***
## transmissionManual                    1.795e+03  6.995e+02   2.566  0.01033 *  
## ownerFourth & Above Owner            -1.338e+00  1.850e+00  -0.723  0.46968    
## ownerSecond Owner                    -2.834e+00  6.712e-01  -4.222 2.46e-05 ***
## ownerTest Drive Car                   1.651e+02  9.013e+00  18.323  < 2e-16 ***
## ownerThird Owner                     -2.958e+00  1.149e+00  -2.575  0.01005 *  
## seller_typeIndividual                -2.473e+00  9.598e-01  -2.577  0.01000 *  
## seller_typeTrustmark Dealer          -4.393e+00  4.055e+00  -1.083  0.27871    
## km_driven_in_10k                      1.602e-01  6.538e-02   2.450  0.01430 *  
## mileage                              -3.380e-01  8.386e-02  -4.031 5.63e-05 ***
## make_categoryLuxury:year             -4.386e+01  3.456e+00 -12.689  < 2e-16 ***
## make_categoryMidrange:year           -4.417e+00  5.385e-01  -8.203 2.90e-16 ***
## make_categoryLuxury:max_power        -6.323e+02  3.692e+01 -17.126  < 2e-16 ***
## make_categoryMidrange:max_power      -8.920e+01  1.201e+01  -7.425 1.30e-13 ***
## year:max_power                        8.290e-02  3.500e-03  23.683  < 2e-16 ***
## year:transmissionManual              -8.944e-01  3.470e-01  -2.577  0.00999 ** 
## make_categoryLuxury:year:max_power    3.145e-01  1.833e-02  17.159  < 2e-16 ***
## make_categoryMidrange:year:max_power  4.446e-02  5.966e-03   7.452 1.06e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.73 on 5348 degrees of freedom
## Multiple R-squared:  0.8476, Adjusted R-squared:  0.847 
## F-statistic:  1416 on 21 and 5348 DF,  p-value: < 2.2e-16
plot(small_model,which=c(1,2))

vif(small_model,type = "predictor")
## GVIFs computed for predictors
##                       GVIF Df GVIF^(1/(2*Df))
## make_category    52.058575 11        1.196802
## year              2.774981 13        1.040036
## max_power        52.058575 11        1.196802
## transmission     28.730113  3        1.750073
## owner             1.441031  4        1.046729
## seller_type       1.139833  2        1.033262
## km_driven_in_10k  1.398158  1        1.182437
## mileage           1.574993  1        1.254987
##                                          Interacts With
## make_category                           year, max_power
## year             make_category, max_power, transmission
## max_power                           make_category, year
## transmission                                       year
## owner                                              --  
## seller_type                                        --  
## km_driven_in_10k                                   --  
## mileage                                            --  
##                                                                                      Other Predictors
## make_category                             transmission, owner, seller_type, km_driven_in_10k, mileage
## year                                                    owner, seller_type, km_driven_in_10k, mileage
## max_power                                 transmission, owner, seller_type, km_driven_in_10k, mileage
## transmission                  make_category, max_power, owner, seller_type, km_driven_in_10k, mileage
## owner            make_category, year, max_power, transmission, seller_type, km_driven_in_10k, mileage
## seller_type            make_category, year, max_power, transmission, owner, km_driven_in_10k, mileage
## km_driven_in_10k            make_category, year, max_power, transmission, owner, seller_type, mileage
## mileage            make_category, year, max_power, transmission, owner, seller_type, km_driven_in_10k

BIC of Small model | Both direction

n= nrow(train_data)
bic_small_model = step(small_model,direction="both",k=log(n),trace=0)
summary(bic_small_model)
## 
## Call:
## lm(formula = selling_price_in_10k ~ make_category + year + max_power + 
##     transmission + owner + mileage + make_category:year + make_category:max_power + 
##     year:max_power + make_category:year:max_power, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -168.318   -9.151   -1.366    7.847  213.414 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           4.597e+03  5.197e+02   8.846  < 2e-16 ***
## make_categoryLuxury                   8.807e+04  6.960e+03  12.654  < 2e-16 ***
## make_categoryMidrange                 9.576e+03  1.069e+03   8.961  < 2e-16 ***
## year                                 -2.276e+00  2.584e-01  -8.809  < 2e-16 ***
## max_power                            -1.660e+02  6.965e+00 -23.834  < 2e-16 ***
## transmissionManual                   -8.056e+00  1.174e+00  -6.859 7.72e-12 ***
## ownerFourth & Above Owner            -1.359e+00  1.849e+00  -0.735  0.46257    
## ownerSecond Owner                    -2.828e+00  6.664e-01  -4.244 2.23e-05 ***
## ownerTest Drive Car                   1.668e+02  9.009e+00  18.518  < 2e-16 ***
## ownerThird Owner                     -2.965e+00  1.145e+00  -2.590  0.00963 ** 
## mileage                              -3.298e-01  8.383e-02  -3.934 8.47e-05 ***
## make_categoryLuxury:year             -4.377e+01  3.455e+00 -12.671  < 2e-16 ***
## make_categoryMidrange:year           -4.769e+00  5.308e-01  -8.985  < 2e-16 ***
## make_categoryLuxury:max_power        -6.406e+02  3.692e+01 -17.349  < 2e-16 ***
## make_categoryMidrange:max_power      -9.691e+01  1.184e+01  -8.182 3.47e-16 ***
## year:max_power                        8.271e-02  3.460e-03  23.905  < 2e-16 ***
## make_categoryLuxury:year:max_power    3.186e-01  1.833e-02  17.383  < 2e-16 ***
## make_categoryMidrange:year:max_power  4.829e-02  5.883e-03   8.209 2.77e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.76 on 5352 degrees of freedom
## Multiple R-squared:  0.8471, Adjusted R-squared:  0.8466 
## F-statistic:  1744 on 17 and 5352 DF,  p-value: < 2.2e-16
plot(bic_small_model,which=c(1,2))

vif(small_model,type = "predictor")
## GVIFs computed for predictors
##                       GVIF Df GVIF^(1/(2*Df))
## make_category    52.058575 11        1.196802
## year              2.774981 13        1.040036
## max_power        52.058575 11        1.196802
## transmission     28.730113  3        1.750073
## owner             1.441031  4        1.046729
## seller_type       1.139833  2        1.033262
## km_driven_in_10k  1.398158  1        1.182437
## mileage           1.574993  1        1.254987
##                                          Interacts With
## make_category                           year, max_power
## year             make_category, max_power, transmission
## max_power                           make_category, year
## transmission                                       year
## owner                                              --  
## seller_type                                        --  
## km_driven_in_10k                                   --  
## mileage                                            --  
##                                                                                      Other Predictors
## make_category                             transmission, owner, seller_type, km_driven_in_10k, mileage
## year                                                    owner, seller_type, km_driven_in_10k, mileage
## max_power                                 transmission, owner, seller_type, km_driven_in_10k, mileage
## transmission                  make_category, max_power, owner, seller_type, km_driven_in_10k, mileage
## owner            make_category, year, max_power, transmission, seller_type, km_driven_in_10k, mileage
## seller_type            make_category, year, max_power, transmission, owner, km_driven_in_10k, mileage
## km_driven_in_10k            make_category, year, max_power, transmission, owner, seller_type, mileage
## mileage            make_category, year, max_power, transmission, owner, seller_type, km_driven_in_10k
bptest(bic_small_model)
## 
##  studentized Breusch-Pagan test
## 
## data:  bic_small_model
## BP = 1694.2, df = 17, p-value < 2.2e-16

ANOVA of BIC Small model and BIC Full model

anova(bic_small_model,bic_full_model)
## Analysis of Variance Table
## 
## Model 1: selling_price_in_10k ~ make_category + year + max_power + transmission + 
##     owner + mileage + make_category:year + make_category:max_power + 
##     year:max_power + make_category:year:max_power
## Model 2: selling_price_in_10k ~ year + km_driven_in_10k + fuel + seller_type + 
##     transmission + owner + engine + max_power + make_category
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1   5352 2089387                                  
## 2   5353 3766202 -1  -1676815 4295.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Influential points

indicies_to_exclude = unname(which(cooks.distance(bic_small_model) > 4/length(cooks.distance(bic_small_model))))
sum(cooks.distance(bic_small_model) > 4/length(cooks.distance(bic_small_model)))
## [1] 236
train_data_filtered = train_data[-indicies_to_exclude,]

Refitting the BIC Small model without influential points

n= nrow(train_data_filtered)
refitted_wip = lm(log(selling_price_in_10k) ~ .+ make_category*year+ make_category*max_power -make-seats-engine, data = train_data_filtered)
                    #make_category*year*max_power + transmission + owner + seller_type + km_driven_in_10k+mileage, data = train_data_filtered)
summary(refitted_wip)
## 
## Call:
## lm(formula = log(selling_price_in_10k) ~ . + make_category * 
##     year + make_category * max_power - make - seats - engine, 
##     data = train_data_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37471 -0.16777  0.01752  0.19196  1.06812 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -2.340e+02  3.332e+00 -70.227  < 2e-16 ***
## year                             1.177e-01  1.660e-03  70.898  < 2e-16 ***
## km_driven_in_10k                -2.605e-03  1.087e-03  -2.396 0.016595 *  
## fuelDiesel                       2.903e-01  4.701e-02   6.176 7.07e-10 ***
## fuelLPG                          1.244e-01  7.234e-02   1.720 0.085480 .  
## fuelPetrol                       1.360e-02  4.699e-02   0.289 0.772240    
## seller_typeIndividual           -5.741e-02  1.481e-02  -3.876 0.000107 ***
## seller_typeTrustmark Dealer      6.331e-02  6.125e-02   1.034 0.301410    
## transmissionManual              -6.795e-02  1.876e-02  -3.621 0.000296 ***
## ownerFourth & Above Owner       -1.569e-01  2.826e-02  -5.552 2.97e-08 ***
## ownerSecond Owner               -7.787e-02  1.009e-02  -7.719 1.40e-14 ***
## ownerThird Owner                -1.092e-01  1.729e-02  -6.317 2.90e-10 ***
## mileage                         -8.665e-03  1.354e-03  -6.398 1.71e-10 ***
## max_power                        9.532e-03  2.705e-04  35.237  < 2e-16 ***
## make_categoryLuxury             -4.440e+01  4.255e+01  -1.043 0.296825    
## make_categoryMidrange            5.666e+00  4.655e+00   1.217 0.223643    
## year:make_categoryLuxury         2.272e-02  2.116e-02   1.074 0.283062    
## year:make_categoryMidrange      -2.834e-03  2.314e-03  -1.225 0.220648    
## max_power:make_categoryLuxury   -5.158e-03  2.729e-03  -1.890 0.058795 .  
## max_power:make_categoryMidrange  1.403e-03  3.718e-04   3.774 0.000163 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2913 on 5114 degrees of freedom
## Multiple R-squared:  0.8239, Adjusted R-squared:  0.8233 
## F-statistic:  1260 on 19 and 5114 DF,  p-value: < 2.2e-16
bic_refitted_wip = step(refitted_wip,direction="both",k=log(n),trace=0)
summary(bic_refitted_wip)
## 
## Call:
## lm(formula = log(selling_price_in_10k) ~ year + fuel + seller_type + 
##     transmission + owner + mileage + max_power + make_category, 
##     data = train_data_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.38778 -0.16910  0.01772  0.19303  1.05425 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -2.330e+02  2.782e+00 -83.731  < 2e-16 ***
## year                         1.171e-01  1.388e-03  84.420  < 2e-16 ***
## fuelDiesel                   2.747e-01  4.690e-02   5.858 4.99e-09 ***
## fuelLPG                      1.116e-01  7.235e-02   1.543    0.123    
## fuelPetrol                   1.139e-02  4.686e-02   0.243    0.808    
## seller_typeIndividual       -6.386e-02  1.473e-02  -4.335 1.49e-05 ***
## seller_typeTrustmark Dealer  6.101e-02  6.134e-02   0.995    0.320    
## transmissionManual          -7.852e-02  1.863e-02  -4.214 2.56e-05 ***
## ownerFourth & Above Owner   -1.610e-01  2.828e-02  -5.694 1.31e-08 ***
## ownerSecond Owner           -7.905e-02  1.007e-02  -7.853 4.93e-15 ***
## ownerThird Owner            -1.109e-01  1.728e-02  -6.417 1.51e-10 ***
## mileage                     -7.517e-03  1.325e-03  -5.673 1.48e-08 ***
## max_power                    1.010e-02  2.205e-04  45.817  < 2e-16 ***
## make_categoryLuxury          3.766e-01  6.430e-02   5.857 5.02e-09 ***
## make_categoryMidrange        7.974e-02  8.849e-03   9.011  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2918 on 5119 degrees of freedom
## Multiple R-squared:  0.8231, Adjusted R-squared:  0.8226 
## F-statistic:  1701 on 14 and 5119 DF,  p-value: < 2.2e-16
vif(bic_refitted_wip,type = "predictor")
## GVIFs computed for predictors
##                   GVIF Df GVIF^(1/(2*Df)) Interacts With
## year          1.722679  1        1.312509           --  
## fuel          1.358473  3        1.052386           --  
## seller_type   1.087360  2        1.021159           --  
## transmission  1.194923  1        1.093125           --  
## owner         1.351413  3        1.051473           --  
## mileage       1.635368  1        1.278815           --  
## max_power     1.902386  1        1.379270           --  
## make_category 1.274009  2        1.062412           --  
##                                                                        Other Predictors
## year          fuel, seller_type, transmission, owner, mileage, max_power, make_category
## fuel          year, seller_type, transmission, owner, mileage, max_power, make_category
## seller_type          year, fuel, transmission, owner, mileage, max_power, make_category
## transmission          year, fuel, seller_type, owner, mileage, max_power, make_category
## owner          year, fuel, seller_type, transmission, mileage, max_power, make_category
## mileage          year, fuel, seller_type, transmission, owner, max_power, make_category
## max_power          year, fuel, seller_type, transmission, owner, mileage, make_category
## make_category          year, fuel, seller_type, transmission, owner, mileage, max_power
plot(bic_refitted_wip,which=c(1,2))

bptest(bic_refitted_wip)
## 
##  studentized Breusch-Pagan test
## 
## data:  bic_refitted_wip
## BP = 574.66, df = 14, p-value < 2.2e-16